An Exploratory Data and Network Analysis of Movies

Clarice, Daven, Lucia, Christopher and Indurain

September 6, 2019


Introduction

In this report, we will be analysing a dataset from Kaggle, which contains movies of different genres produced over a vast number of years. What makes this analysis interesting is that we can try and draw various conclusions based on a movie’s popularity, directors or actors involved, year of production, and so forth. Moreover, we can construct various networks in an attempt to find meaningful and interesting results. When inspecting a database of films from recent years, various interesting inferences are uncovered. A film may have a high rating yet low return on investment (ROI). Which genre would you guess is the most successful? Which actors do you think are the most popular?

We have split our Exploratory Data Analysis into four main parts:

Section
1 Introducing the Data
- We first try to understand the data and look at its content.
2 Pre-Processing
- We look at what needs to be altered or removed from the dataset.
- We try clean any dirty text.
- We try to minimise the dataset’s missing values.
3 Exploring the Data
- We conduct basic analysis on the dataset.
- We explore genres.
- We explore movie popularity.
- We look at profit, gross, and return of interests with movies.
- We conduct more advanced analysis on the dataset.
4 Network Analysis
- We measure the network (centrality, degree distribution, number of components, average degree)
- We use network measures to highlight certain nodes (actors) and see which measures of an actor will increase ratings and budgets.

Admin

Before we start, let’s keep this code chunk for importing the correct libraries and loading the appropriate dataset. We use pacman to load the following:

We import the dataset like this:

In the next section we can introduce our dataset and look its content.


Introducing The Dataset

This section of the report is quite essential for our analysis. We cannot make any interesting inferences from the dataset if we do not know what is contained within it. In this section we will try to understand exactly what we are dealing with. Thereafter, we can begin to draw interesting results. We have already read in our dataset called movie_metadata, so we can see the following:

The dataset contains 28 unique columns/variables, each of which are described in the table below:

Variable Name Description
color Specifies whether a movie is in black and white or color
director_name Contains name of the director of a movie
num_critic_for_reviews Contains number of critic reviews per movie
duration Contains duration of a movie in minutes
director_facebook_likes Contains number of facebook likes for a director
actor_3_facebook_likes Contains number of facebook likes for actor 3
actor_2_name Contains name of 2nd leading actor of a movie
actor_1_facebook_likes Contains number of facebook likes for actor 1
gross Contains the amount a movie grossed in USD
genres Contains the sub-genres to which a movie belongs
actor_1_name Contains name of the actor in lead role
movie_title Title of the Movie
num_voted_users Contains number of users votes for a movie
cast_total_facebook_likes Contains number of facebook likes for the entire cast of a movie
actor_3_name Contains the name of the 3rd leading actor of a movie
facenumber_in_poster Contains number of actors faces on a movie poster
plot_keywords Contains key plot words associated with a movie
movie_imdb_link Contains the link to the imdb movie page
num_user_for_reviews Contains the number of user generated reviews per movie
language Contains the language of a movie
country Contains the name of the country in which a movie was made
content_rating Contains maturity rating of a movie
budget Contains the amount of money spent in production per movie
title_year Contains the year in which a film was released
actor_2_facebook_likes Contains number of facebook likes for actor 2
imdb_score Contains user generated rating per movie
aspect_ratio Contains the size of the aspect ratio of a movie
movie_facebook_likes Number of likes of the movie on its Facebook Page

Furthermore, the dataset contains 5043 movies, spanning accross 96 years in 46 countries. There are 1693 unique director names and 5390 number of actors/actresses. Around 79% of the movies are from the USA, 8% from UK, and 13% from other countries.

The structure of the dataset can also be used to understand our data. We can run the following code chunk to see its structure.

## 'data.frame':    5043 obs. of  28 variables:
##  $ color                    : chr  "Color" "Color" "Color" "Color" ...
##  $ director_name            : chr  "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
##  $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
##  $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
##  $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
##  $ actor_2_name             : chr  "Joel David Moore" "Orlando Bloom" "Rory Kinnear" "Christian Bale" ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
##  $ genres                   : chr  "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy" "Action|Adventure|Thriller" "Action|Thriller" ...
##  $ actor_1_name             : chr  "CCH Pounder" "Johnny Depp" "Christoph Waltz" "Tom Hardy" ...
##  $ movie_title              : chr  "Avatar " "Pirates of the Caribbean: At World's End " "Spectre " "The Dark Knight Rises " ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
##  $ actor_3_name             : chr  "Wes Studi" "Jack Davenport" "Stephanie Sigman" "Joseph Gordon-Levitt" ...
##  $ facenumber_in_poster     : int  0 0 1 0 0 1 0 1 4 3 ...
##  $ plot_keywords            : chr  "avatar|future|marine|native|paraplegic" "goddess|marriage ceremony|marriage proposal|pirate|singapore" "bomb|espionage|sequel|spy|terrorist" "deception|imprisonment|lawlessness|police officer|terrorist plot" ...
##  $ movie_imdb_link          : chr  "http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1" ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 NA 738 1902 387 1117 973 ...
##  $ language                 : chr  "English" "English" "English" "English" ...
##  $ country                  : chr  "USA" "USA" "UK" "USA" ...
##  $ content_rating           : chr  "PG-13" "PG-13" "PG-13" "PG-13" ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
##  $ title_year               : int  2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 12 632 11000 553 21000 11000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 0 24000 0 29000 118000 10000 ...

In the next section we can start preparing the dataset for analyis by removing or simplifying some of the data.


Pre-Processing Data

In this part of the report we attempt to look for various things that may have a negative or insignificant impact on the inferences we make on the dataset. Once we have sufficiently cleaned and prepared the dataset, we can commence with drawing various conclusions from the graphs we generate.

Duplicate Rows

In movie_metadata, we have some duplicate rows, so we want to remove the 45 duplicated rows and keep the unique ones.

## [1] 45

Missing Values

Let’s have a look at the number of NA values in our dataset:

##                     color             director_name 
##                         0                         0 
##    num_critic_for_reviews                  duration 
##                        49                        15 
##   director_facebook_likes    actor_3_facebook_likes 
##                       103                        23 
##              actor_2_name    actor_1_facebook_likes 
##                         0                         7 
##                     gross                    genres 
##                       874                         0 
##              actor_1_name               movie_title 
##                         0                         0 
##           num_voted_users cast_total_facebook_likes 
##                         1                         1 
##              actor_3_name      facenumber_in_poster 
##                         0                        13 
##             plot_keywords           movie_imdb_link 
##                         0                         0 
##      num_user_for_reviews                  language 
##                        22                         0 
##                   country            content_rating 
##                         0                         0 
##                    budget                title_year 
##                       488                       108 
##    actor_2_facebook_likes                imdb_score 
##                        14                         1 
##              aspect_ratio      movie_facebook_likes 
##                       328                         1

To help visualise this, have a look at the following heatmap of the missing values:

## 
##  Variables sorted by number of missings: 
##                   Variable       Count
##                      gross 0.174869948
##                     budget 0.097639056
##               aspect_ratio 0.065626251
##                 title_year 0.021608643
##    director_facebook_likes 0.020608243
##     num_critic_for_reviews 0.009803922
##     actor_3_facebook_likes 0.004601841
##       num_user_for_reviews 0.004401761
##                   duration 0.003001200
##     actor_2_facebook_likes 0.002801120
##       facenumber_in_poster 0.002601040
##     actor_1_facebook_likes 0.001400560
##            num_voted_users 0.000200080
##  cast_total_facebook_likes 0.000200080
##                 imdb_score 0.000200080
##       movie_facebook_likes 0.000200080
##                      color 0.000000000
##              director_name 0.000000000
##               actor_2_name 0.000000000
##                     genres 0.000000000
##               actor_1_name 0.000000000
##                movie_title 0.000000000
##               actor_3_name 0.000000000
##              plot_keywords 0.000000000
##            movie_imdb_link 0.000000000
##                   language 0.000000000
##                    country 0.000000000
##             content_rating 0.000000000

Gross and Budget

Since gross and budget have too many missing values (874 and 488), and we want to keep these two variables for the following analysis, we can only delete rows with null values for gross and budget because imputation will not do a good job here.

## [1] 3857   28

The difference in observations have decreased by 4998 - 3857 = 1141 which is luckily only 22.8% of the previous total observations. Let’s have a look at how many complete cases we have.

Content Rating

## 
##            Approved         G        GP         M     NC-17 Not Rated 
##        51        17        91         1         2         6        42 
##    Passed        PG     PG-13         R   Unrated         X 
##         3       573      1314      1723        24        10

According to the history of naming these different content ratings, we find M = GP = PG, X = NC-17. We want to replace M and GP with PG, replace X with NC-17, because these two are what we use nowadays.

We want to replace Approved, Not Rated, Passed, Unrated with the most common rating R.

## 
##           G NC-17    PG PG-13     R 
##    51    91    16   576  1314  1809

Blanks should be taken as missing value. Since these missing values cannot be replaced with reasonable data, we delete these rows.

Delete (Some) Rows

##                     color             director_name 
##                         0                         0 
##    num_critic_for_reviews                  duration 
##                         1                         0 
##   director_facebook_likes    actor_3_facebook_likes 
##                         0                         6 
##              actor_2_name    actor_1_facebook_likes 
##                         0                         1 
##                     gross                    genres 
##                         0                         0 
##              actor_1_name               movie_title 
##                         0                         0 
##           num_voted_users cast_total_facebook_likes 
##                         0                         0 
##              actor_3_name      facenumber_in_poster 
##                         0                         6 
##             plot_keywords           movie_imdb_link 
##                         0                         0 
##      num_user_for_reviews                  language 
##                         0                         0 
##                   country            content_rating 
##                         0                         0 
##                    budget                title_year 
##                         0                         0 
##    actor_2_facebook_likes                imdb_score 
##                         2                         0 
##              aspect_ratio      movie_facebook_likes 
##                        55                         0

We remove aspect_ratio because 1 it has a lot of missing values and 2 we will not be looking into the impact that it has on other data (we assume that it doesn’t).

Add a Column

Gross and Budget

We have gross and budget information. So let’s add two colums: profit and percentage return on investment for further analysis.

Remove (Some) Columns

Colour

Next, we take a look at the influence of colour vs black and white.

## 
##                   Black and White            Color 
##                2              124             3680

Since 3.4%of the data is in black and white, we can remove the color column it.

Language

Let’s have a look at the different languages contained within the dataset.

## 
##            Aboriginal     Arabic    Aramaic    Bosnian  Cantonese 
##          2          2          1          1          1          7 
##      Czech     Danish       Dari      Dutch    English   Filipino 
##          1          3          2          3       3644          1 
##     French     German     Hebrew      Hindi  Hungarian Indonesian 
##         34         11          2          5          1          2 
##    Italian   Japanese     Kazakh     Korean   Mandarin       Maya 
##          7         10          1          5         14          1 
##  Mongolian       None  Norwegian    Persian Portuguese   Romanian 
##          1          1          4          3          5          1 
##    Russian    Spanish       Thai Vietnamese       Zulu 
##          1         24          3          1          1

Almost 95% movies are in English, which means this variable is nearly constant. Let’s remove it.

Country

Next, we can look at the different types of countries.

## 
##    Afghanistan      Argentina          Aruba      Australia        Belgium 
##              1              3              1             40              1 
##         Brazil         Canada          Chile          China       Colombia 
##              5             63              1             13              1 
## Czech Republic        Denmark        Finland         France        Georgia 
##              3              9              1            103              1 
##        Germany         Greece      Hong Kong        Hungary        Iceland 
##             79              1             13              2              1 
##          India      Indonesia           Iran        Ireland         Israel 
##              5              1              4              7              2 
##          Italy          Japan         Mexico    Netherlands       New Line 
##             11             15             10              3              1 
##    New Zealand         Norway  Official site           Peru    Philippines 
##             11              4              1              1              1 
##         Poland        Romania         Russia   South Africa    South Korea 
##              1              2              3              3              8 
##          Spain         Taiwan       Thailand             UK            USA 
##             22              2              4            316           3025 
##   West Germany 
##              1

Around 79% movies are from USA, 8% from UK, 13% from other countries. So we group other countries together to make this categorical variable with less levels: USA, UK, Others.

## 
## Others     UK    USA 
##    465    316   3025

Now that we’ve cleaned up our dataset, we can now continue to explore our data even further! In the next section we will be looking at genres, movie popularity, gross, profit, and many more other aspects pertinent to our data.


Analysing Data

When inspecting a dataset of movies over the past few years, various interesting inferences can be uncovered. A movie may have a high rating yet low return on investment. Which genre is the most successful? Which actors are the most popular? These are some of the questions we aim to answer in this section.

We can start by performing basic analyis on our data. Thereafter, we delve a bit deeper into more specific parts of the dataset, in hopes of uncovering interesting observations.

Movie Genre Analysis

Now we can delve into more specific things regarding movies, like genres.

Split Genres

As you can see, movies have multiple genres that its associated with. For analysis purposes, we choose to use the first word in the genre column, as this is likely the most accurate description.

## [1] "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy"       
## [3] "Action|Adventure|Thriller"       "Action|Thriller"                
## [5] "Action|Adventure|Sci-Fi"         "Action|Adventure|Romance"

Let’s split the genres separated by “|” into 8 different columns.

genre_df consists of 8 columns, each with different genres. Let’s have a look at the frequency of all the genres.

It is evident that the Drama and Comedy genre are the most popular to be produced. However, this does not mean that they are the most profitable, returning successful ROI’s. This will further be explored.

Previously we assumed that the first genre is the most applicable, therefore, we choose the first column as the genre for the movie and append it to the dataframe.

How does this distribution look like over the years? Lets have a look at the frequency of genres between the period of 1980 and 2016.

The heat map allows us to see that the popular genres with high frequencys are constantly being produced more and more often over the years. It is evident by the darker shades of blue becomming more prominent in the latter years.

Popularity Analysis

IMDB ratings VS Movie Count

It is evident from the histogram that the majority of movies are rated between 6-8 out of 10. If a movie has a lower rating, it is clearly very unpopular or very bad. If a movie has a higher rating, it is evidently fantastic.

Popularity over the years

A spike in popularity in 2004 is obvious owing to the fact that there is a clear rise in popularity score.The creation of facebook and the effect of social media influenced this.

Facebook Likes VS IMDB Score

It is evident that as a movie has higher ratings, the number of likes on facebook increases. This is probably due to the fact that a critic giving a movie a good rating, will increase the want of people to see and find more out about the move.

Top 20 directors with highest average IMDB score

director_name avg_imdb
Tony Kaye 8.600000
Damien Chazelle 8.500000
Majid Majidi 8.500000
Ron Fricke 8.500000
Christopher Nolan 8.425000
Asghar Farhadi 8.400000
Marius A. Markevicius 8.400000
Richard Marquand 8.400000
Sergio Leone 8.400000
Lee Unkrich 8.300000
Lenny Abrahamson 8.300000
Pete Docter 8.233333
Hayao Miyazaki 8.225000
Joshua Oppenheimer 8.200000
Juan José Campanella 8.200000
Quentin Tarantino 8.200000
David Sington 8.100000
Je-kyu Kang 8.100000
Terry George 8.100000
Tim Miller 8.100000

IMDB rating systen started in 1990’s. Social media platforms like Facebook had started in the mid 2000’s.

Vote Counts VS IMDB score


Network Analysis